Project Description

Data Description: The data contains features extracted from the silhouette of vehicles in different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. Domain: Object recognition Context: The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Attribute Information: ● All the features are geometric features extracted from the silhouette. ● All are numeric in nature.

Learning Outcomes: ● Exploratory Data Analysis ● Reduce number dimensions in the dataset with minimal information loss ● Train a model using Principal Components

Objective: Apply dimensionality reduction technique – PCA and train a model using principal components instead of training the model using raw data.

In [1]:
#importing some necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
dataset = pd.read_csv('vehicle.csv')
In [3]:
dataset.head()
Out[3]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [4]:
dataset.shape
Out[4]:
(846, 19)
In [5]:
dataset.describe().transpose()
Out[5]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.0 119.0
circularity 841.0 44.828775 6.152172 33.0 40.00 44.0 49.0 59.0
distance_circularity 842.0 82.110451 15.778292 40.0 70.00 80.0 98.0 112.0
radius_ratio 840.0 168.888095 33.520198 104.0 141.00 167.0 195.0 333.0
pr.axis_aspect_ratio 844.0 61.678910 7.891463 47.0 57.00 61.0 65.0 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.0 55.0
scatter_ratio 845.0 168.901775 33.214848 112.0 147.00 157.0 198.0 265.0
elongatedness 845.0 40.933728 7.816186 26.0 33.00 43.0 46.0 61.0
pr.axis_rectangularity 843.0 20.582444 2.592933 17.0 19.00 20.0 23.0 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.0 188.0
scaled_variance 843.0 188.631079 31.411004 130.0 167.00 179.0 217.0 320.0
scaled_variance.1 844.0 439.494076 176.666903 184.0 318.00 363.5 587.0 1018.0
scaled_radius_of_gyration 844.0 174.709716 32.584808 109.0 149.00 173.5 198.0 268.0
scaled_radius_of_gyration.1 842.0 72.447743 7.486190 59.0 67.00 71.5 75.0 135.0
skewness_about 840.0 6.364286 4.920649 0.0 2.00 6.0 9.0 22.0
skewness_about.1 845.0 12.602367 8.936081 0.0 5.00 11.0 19.0 41.0
skewness_about.2 845.0 188.919527 6.155809 176.0 184.00 188.0 193.0 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.0 211.0
In [6]:
dataset.dtypes
Out[6]:
compactness                      int64
circularity                    float64
distance_circularity           float64
radius_ratio                   float64
pr.axis_aspect_ratio           float64
max.length_aspect_ratio          int64
scatter_ratio                  float64
elongatedness                  float64
pr.axis_rectangularity         float64
max.length_rectangularity        int64
scaled_variance                float64
scaled_variance.1              float64
scaled_radius_of_gyration      float64
scaled_radius_of_gyration.1    float64
skewness_about                 float64
skewness_about.1               float64
skewness_about.2               float64
hollows_ratio                    int64
class                           object
dtype: object
In [7]:
# Finding number of records for each unique target variable values. Note the target variable is class
dataset['class'].value_counts()
Out[7]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [8]:
#Boxplot to understand spread and outliers
dataset.plot(kind='box', figsize=(20,10))
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ede65b97f0>
In [9]:
dataset.hist(figsize=(15,15))
Out[9]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6E11470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6E56128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6E7D6A0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6A14BE0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6A46198>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6A6E710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6A93C88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6ACA278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6ACA2B0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6B17D30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6B482E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6B6F860>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6B97DD8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6BC9390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6BF0908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6C1BE10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6C483C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6C70940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6C9BEB8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001EDE6CCA470>]],
      dtype=object)
In [10]:
# Checking for null values in all the attributes
dataset.isnull().sum()
Out[10]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
In [11]:
#Replace missing values by median of column value as their are many outliers otherwise we would replace it with means
for i in dataset.columns[:17]:
    median_value = dataset[i].median()
    dataset[i] = dataset[i].fillna(median_value)
In [12]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    846 non-null float64
distance_circularity           846 non-null float64
radius_ratio                   846 non-null float64
pr.axis_aspect_ratio           846 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  846 non-null float64
elongatedness                  846 non-null float64
pr.axis_rectangularity         846 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                846 non-null float64
scaled_variance.1              846 non-null float64
scaled_radius_of_gyration      846 non-null float64
scaled_radius_of_gyration.1    846 non-null float64
skewness_about                 846 non-null float64
skewness_about.1               846 non-null float64
skewness_about.2               846 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [13]:
# chercking for null values again to ensure there are no null values present
dataset.isnull().sum()
Out[13]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [14]:
# Identifying the outliers and replacing them with median

for col_name in dataset.columns[:-1]:
    q1 = dataset[col_name].quantile(0.25)
    q3 = dataset[col_name].quantile(0.75)
    iqr = q3 - q1
    
    low = q1-1.5*iqr
    high = q3+1.5*iqr
    
    dataset.loc[(dataset[col_name] < low) | (dataset[col_name] > high), col_name] = dataset[col_name].median()
In [15]:
dataset.describe().transpose()
Out[15]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.330969 32.147908 104.0 141.00 167.0 194.75 252.0
pr.axis_aspect_ratio 846.0 61.154846 5.613458 47.0 57.00 61.0 65.00 76.0
max.length_aspect_ratio 846.0 8.118203 2.064114 3.0 7.00 8.0 10.00 13.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.430260 31.034232 130.0 167.00 179.0 216.75 288.0
scaled_variance.1 846.0 437.790780 174.346065 184.0 318.25 363.5 586.00 987.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 71.943853 6.158852 59.0 67.00 71.5 75.00 87.0
skewness_about 846.0 6.147754 4.572950 0.0 2.00 6.0 9.00 19.0
skewness_about.1 846.0 12.565012 8.877465 0.0 5.00 11.0 19.00 40.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
In [16]:
# checking whether outliers fixed or not
dataset.plot(kind='box', figsize=(20,10))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ede7785710>

Studying the dataset using Agglomerative Hierarchical Clustering technique

In [17]:
hdf=dataset.copy()
In [18]:
for feature in hdf.columns: # Loop through all columns in the dataframe
    if hdf[feature].dtype == 'object': # Only apply for columns with categorical strings
        hdf[feature] = pd.Categorical(hdf[feature]).codes # Replace strings with an integer
In [19]:
#importing seaborn for statistical plots
import seaborn as sns


sns.pairplot(hdf, size=7,aspect=0.5 , diag_kind='kde')
C:\Users\Asus\Anaconda3\lib\site-packages\seaborn\axisgrid.py:2065: UserWarning: The `size` parameter has been renamed to `height`; pleaes update your code.
  warnings.warn(msg, UserWarning)
Out[19]:
<seaborn.axisgrid.PairGrid at 0x1ede9776cf8>
In [20]:
# From analysing the diagonal panels we can find the number of clusters that can be taken by observing the highest number number of peaks and from off diagonal panels we can choose the distance calculation method that can be used. In this case we choose number of clusters as 3. 

Distance calculation method used is Manhattan

In [21]:
from sklearn.cluster import AgglomerativeClustering 

model = AgglomerativeClustering(n_clusters=3, affinity='manhattan',  linkage='average')
model.fit(hdf)
Out[21]:
AgglomerativeClustering(affinity='manhattan', compute_full_tree='auto',
            connectivity=None, linkage='average', memory=None,
            n_clusters=3, pooling_func='deprecated')
In [22]:
hdf['labels'] = model.labels_

hdf.groupby(["labels"]).count()
Out[22]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
labels
0 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256 256
1 568 568 568 568 568 568 568 568 568 568 568 568 568 568 568 568 568 568 568
2 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22 22
In [23]:
hdf_clusters = hdf.groupby(['labels'])
In [24]:
print(hdf_clusters)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001EDE982B160>
In [25]:
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
In [26]:
from scipy.spatial.distance import pdist  #Pairwise distribution between data points
In [27]:
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
# using average linkage
Z = linkage(hdf, 'average')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[27]:
0.8382945923769921
In [28]:
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
In [29]:
# Using complete linkage
Z = linkage(hdf, 'complete')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[29]:
0.8001459453856393
In [30]:
plt.figure(figsize=(15, 15))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=90,  leaf_font_size=10. )
plt.tight_layout()
In [31]:
# linkage as ward
Z = linkage(hdf, 'ward')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[31]:
0.8267438402868998
In [32]:
plt.figure(figsize=(15, 15))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold=600,  leaf_font_size=10. )
plt.tight_layout()

Distance calculation method used is Euclidean

In [33]:
model = AgglomerativeClustering(n_clusters=3, affinity='euclidean')
model.fit(hdf)
Out[33]:
AgglomerativeClustering(affinity='euclidean', compute_full_tree='auto',
            connectivity=None, linkage='ward', memory=None, n_clusters=3,
            pooling_func='deprecated')
In [34]:
hdf['labels'] = model.labels_

hdf.groupby(["labels"]).count()
Out[34]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
labels
0 239 239 239 239 239 239 239 239 239 239 239 239 239 239 239 239 239 239 239
1 463 463 463 463 463 463 463 463 463 463 463 463 463 463 463 463 463 463 463
2 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144 144
In [35]:
hdf_clusters = hdf.groupby(['labels'])
print(hdf_clusters)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001ED8E3160F0>
In [36]:
Z = linkage(hdf, 'average')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[36]:
0.8383000495863547
In [37]:
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
In [38]:
Z = linkage(hdf, 'complete')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[38]:
0.8001459146545137
In [39]:
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()
In [40]:
Z = linkage(hdf, 'ward')
c, coph_dists = cophenet(Z , pdist(hdf))

c
Out[40]:
0.826751083554093
In [41]:
plt.figure(figsize=(10, 10))
plt.title('Agglomerative Hierarchical Clustering Dendogram')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z, leaf_rotation=90.,color_threshold = 40, leaf_font_size=8. )
plt.tight_layout()

Principal Component Analysis (PCA)

In [42]:
# Importing necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import zscore
In [43]:
pcdf=dataset.copy()
In [44]:
for feature in pcdf.columns: # Loop through all columns in the dataframe
    if pcdf[feature].dtype == 'object': # Only apply for columns with categorical strings
        pcdf[feature] = pd.Categorical(pcdf[feature]).codes # Replace strings with an integer
In [45]:
pcdf.head()
Out[45]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95.0 48.0 83.0 178.0 72.0 10.0 162.0 42.0 20.0 159.0 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197.0 2
1 91.0 41.0 84.0 141.0 57.0 9.0 149.0 45.0 19.0 143.0 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199.0 2
2 104.0 50.0 106.0 209.0 66.0 10.0 207.0 32.0 23.0 158.0 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196.0 1
3 93.0 41.0 82.0 159.0 63.0 9.0 144.0 46.0 19.0 143.0 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207.0 2
4 85.0 44.0 70.0 205.0 61.0 8.0 149.0 45.0 19.0 144.0 241.0 325.0 188.0 71.5 9.0 11.0 180.0 183.0 0
In [46]:
# Note bus is replaced with 0 , car with 1 and van with 2 in class column
In [47]:
# separating the target column and the other independent columns
X = pcdf.iloc[:,0:18]
y = pcdf.iloc[:,18]
In [48]:
sns.pairplot(pcdf, diag_kind='kde') 
Out[48]:
<seaborn.axisgrid.PairGrid at 0x1ed929824e0>
In [50]:
# Note:- From pair plot we see that the independent columns have a strong correlation among them , they are impacting one another which means lot of redundant information will be fed to the model. Hence to overcome redundancy and reduce loss of information we use PCA
In [51]:
# We transform (centralize) the entire X (independent variable data) to zscores through transformation. We will create the PCA dimensions
# on this distribution. 
sc = StandardScaler()
X_std =  sc.fit_transform(X)          
cov_matrix = np.cov(X_std.T)
print('Covariance Matrix \n%s', cov_matrix)
Covariance Matrix 
%s [[ 1.00118343  0.68569786  0.79086299  0.72277977  0.1930925   0.50051942
   0.81358214 -0.78968322  0.81465658  0.67694334  0.77078163  0.80712401
   0.58593517 -0.24697246  0.19754181  0.1565327   0.29889034  0.36598446]
 [ 0.68569786  1.00118343  0.79325751  0.63903532  0.20349327  0.5611334
   0.8489411  -0.82244387  0.84439802  0.96245572  0.80371846  0.82844154
   0.92691166  0.06882659  0.13651201 -0.00967793 -0.10455005  0.04640562]
 [ 0.79086299  0.79325751  1.00118343  0.79516215  0.24462154  0.66759792
   0.90614687 -0.9123854   0.89408198  0.77544391  0.87061349  0.88498924
   0.70660663 -0.22962442  0.09922417  0.26265581  0.14627113  0.33312625]
 [ 0.72277977  0.63903532  0.79516215  1.00118343  0.65132393  0.46450748
   0.77085211 -0.82636872  0.74502008  0.58015378  0.78711387  0.76115704
   0.55142559 -0.39092105  0.03579728  0.17981316  0.40632957  0.49234013]
 [ 0.1930925   0.20349327  0.24462154  0.65132393  1.00118343  0.15047265
   0.19442484 -0.29849719  0.16323988  0.14776643  0.20734569  0.19663295
   0.14876723 -0.32144977 -0.05609621 -0.02111342  0.401356    0.41622574]
 [ 0.50051942  0.5611334   0.66759792  0.46450748  0.15047265  1.00118343
   0.49133933 -0.50477756  0.48850876  0.64347365  0.40186618  0.46379685
   0.39786723 -0.33584133  0.08199536  0.14183116  0.08389276  0.41366325]
 [ 0.81358214  0.8489411   0.90614687  0.77085211  0.19442484  0.49133933
   1.00118343 -0.97275069  0.99092181  0.81004084  0.96201996  0.98160681
   0.80082111  0.01132718  0.06431825  0.21189733  0.00563439  0.1189581 ]
 [-0.78968322 -0.82244387 -0.9123854  -0.82636872 -0.29849719 -0.50477756
  -0.97275069  1.00118343 -0.95011894 -0.77677186 -0.94876596 -0.94997386
  -0.76722075  0.07848365 -0.04699819 -0.18385891 -0.11526213 -0.2171615 ]
 [ 0.81465658  0.84439802  0.89408198  0.74502008  0.16323988  0.48850876
   0.99092181 -0.95011894  1.00118343  0.81189327  0.94845027  0.97475823
   0.79763248  0.02757736  0.07321311  0.21405404 -0.01867064  0.09940372]
 [ 0.67694334  0.96245572  0.77544391  0.58015378  0.14776643  0.64347365
   0.81004084 -0.77677186  0.81189327  1.00118343  0.75110957  0.79056684
   0.86747579  0.05391989  0.13085669  0.00413356 -0.10407076  0.07686047]
 [ 0.77078163  0.80371846  0.87061349  0.78711387  0.20734569  0.40186618
   0.96201996 -0.94876596  0.94845027  0.75110957  1.00118343  0.94489677
   0.78600191  0.02585841  0.02472235  0.19735505  0.01518932  0.08643233]
 [ 0.80712401  0.82844154  0.88498924  0.76115704  0.19663295  0.46379685
   0.98160681 -0.94997386  0.97475823  0.79056684  0.94489677  1.00118343
   0.78389866  0.00939688  0.0658085   0.20518392  0.01757781  0.11978365]
 [ 0.58593517  0.92691166  0.70660663  0.55142559  0.14876723  0.39786723
   0.80082111 -0.76722075  0.79763248  0.86747579  0.78600191  0.78389866
   1.00118343  0.21553366  0.16316265 -0.05573322 -0.22471583 -0.11814142]
 [-0.24697246  0.06882659 -0.22962442 -0.39092105 -0.32144977 -0.33584133
   0.01132718  0.07848365  0.02757736  0.05391989  0.02585841  0.00939688
   0.21553366  1.00118343 -0.05782288 -0.12414277 -0.83372383 -0.90239877]
 [ 0.19754181  0.13651201  0.09922417  0.03579728 -0.05609621  0.08199536
   0.06431825 -0.04699819  0.07321311  0.13085669  0.02472235  0.0658085
   0.16316265 -0.05782288  1.00118343 -0.04178316  0.0867631   0.06269293]
 [ 0.1565327  -0.00967793  0.26265581  0.17981316 -0.02111342  0.14183116
   0.21189733 -0.18385891  0.21405404  0.00413356  0.19735505  0.20518392
  -0.05573322 -0.12414277 -0.04178316  1.00118343  0.07456104  0.20088894]
 [ 0.29889034 -0.10455005  0.14627113  0.40632957  0.401356    0.08389276
   0.00563439 -0.11526213 -0.01867064 -0.10407076  0.01518932  0.01757781
  -0.22471583 -0.83372383  0.0867631   0.07456104  1.00118343  0.89363767]
 [ 0.36598446  0.04640562  0.33312625  0.49234013  0.41622574  0.41366325
   0.1189581  -0.2171615   0.09940372  0.07686047  0.08643233  0.11978365
  -0.11814142 -0.90239877  0.06269293  0.20088894  0.89363767  1.00118343]]
In [52]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eigenvectors)
print('\n Eigen Values \n%s', eigenvalues)
Eigen Vectors 
%s [[-2.72502890e-01 -8.70435783e-02  3.81852075e-02  1.38675013e-01
  -1.37101466e-01  2.63611383e-01  2.02717114e-01 -7.58796410e-01
   3.66685918e-01  1.60045219e-01  8.40252779e-02  2.14645175e-02
  -1.87350749e-02  6.89082276e-02  4.26105276e-02  9.97784975e-02
  -8.22590084e-02 -3.30366937e-02]
 [-2.87254690e-01  1.31621757e-01  2.01146908e-01 -3.80554832e-02
   1.38995553e-01 -7.13474241e-02 -3.92275358e-01 -6.76034223e-02
   5.53261885e-02 -1.82323962e-01 -3.65229874e-02  1.47247511e-01
  -4.89102355e-02  5.90534770e-02 -6.74107885e-01  1.63466948e-01
  -2.59100771e-01  2.48832011e-01]
 [-3.02421105e-01 -4.61430061e-02 -6.34621085e-02  1.08954287e-01
   8.00174278e-02 -1.69006151e-02  1.63371282e-01  2.77371950e-01
   7.46784853e-02  2.73033778e-01  4.68505530e-01  6.52730855e-01
   4.74162132e-03 -1.62108150e-01 -4.99754439e-04 -6.36582307e-02
   1.20629778e-01  9.80561531e-02]
 [-2.69713545e-01 -1.97931263e-01 -5.62851689e-02 -2.54355087e-01
  -1.33744367e-01 -1.38183653e-01  1.61910525e-01  1.10544748e-01
   2.66666666e-01 -5.05987218e-02 -5.45526034e-01  7.52188680e-02
   3.70499547e-03 -3.93288246e-01  1.74861248e-01 -1.33284415e-01
  -1.86241567e-01  3.60765151e-01]
 [-9.78607336e-02 -2.57839952e-01  6.19927464e-02 -6.12765722e-01
  -1.23601456e-01 -5.77828612e-01  9.27633094e-02 -1.86858758e-01
  -3.86296562e-02 -3.43037888e-02  2.65023238e-01 -2.40287269e-02
   8.90928349e-03  1.63771153e-01 -6.31976228e-02  2.14665592e-02
   1.24639367e-01 -1.77647590e-01]
 [-1.95200137e-01 -1.08045626e-01  1.48957820e-01  2.78678159e-01
   6.34893356e-01 -2.89096995e-01  3.98266293e-01 -4.62187969e-02
  -1.37163365e-01  1.77960797e-01 -1.92846020e-01 -2.29741488e-01
   4.09727876e-03  1.36576102e-01 -9.62482815e-02 -6.89934316e-02
   1.40804371e-01  9.99006987e-02]
 [-3.10523932e-01  7.52853487e-02 -1.09042833e-01  5.39294828e-03
  -8.55574543e-02  9.77471088e-02  9.23519412e-02  6.46204209e-02
  -1.31567659e-01 -1.43132644e-01  9.67172431e-02 -1.53118496e-01
   8.55513044e-01  6.48917601e-02 -4.36596954e-02 -1.56585696e-01
  -1.43109720e-01 -5.28457504e-02]
 [ 3.09006904e-01 -1.32299375e-02  9.08526930e-02  6.52148575e-02
   7.90734442e-02 -7.57282937e-02 -1.04070600e-01 -1.92342823e-01
   2.89633509e-01 -7.93831124e-02 -2.29926427e-02  2.33454000e-02
   2.61858734e-01 -4.96273257e-01 -3.08568675e-01 -2.44030327e-01
   5.11966770e-01 -9.49906147e-02]
 [-3.07287000e-01  8.75601978e-02 -1.06070496e-01  3.08991500e-02
  -8.16463820e-02  1.05403228e-01  9.31317767e-02  1.38684573e-02
  -8.95291026e-02 -2.39896699e-01  1.59356923e-01 -2.17636238e-01
  -4.22479708e-01 -1.13664100e-01 -1.63739102e-01 -6.71547392e-01
  -6.75916711e-02 -2.16727165e-01]
 [-2.78154157e-01  1.22154240e-01  2.13684693e-01  4.14674720e-02
   2.51112937e-01 -7.81962142e-02 -3.54564344e-01 -2.15163418e-01
  -1.58231983e-01 -3.82739482e-01 -1.42837015e-01  3.15261003e-01
   2.00493082e-02 -8.66067604e-03  5.08763287e-01 -5.00643538e-02
   1.60926059e-01 -2.00262071e-01]
 [-2.99765086e-01  7.72657535e-02 -1.44599805e-01 -6.40050869e-02
  -1.47471227e-01  1.32912405e-01  6.80546125e-02  1.95678724e-01
   4.27034669e-02  1.66090908e-01 -4.59667614e-01  1.18383161e-01
  -4.15194745e-02  1.35985919e-01 -2.52182911e-01  2.17416166e-01
   3.24139804e-01 -5.53139002e-01]
 [-3.05532374e-01  7.15030171e-02 -1.10343735e-01 -2.19687048e-03
  -1.10100984e-01  1.15398218e-01  9.01194270e-02  3.77948210e-02
  -1.51072666e-01 -2.87457686e-01  2.09345615e-01 -3.31340876e-01
  -1.22365190e-01 -2.42922436e-01  3.94502237e-02  4.48936624e-01
   4.62827872e-01  3.22499534e-01]
 [-2.63237620e-01  2.10582046e-01  2.02870191e-01 -8.55396458e-02
   5.21210685e-03 -6.70573978e-02 -4.55292717e-01  1.46752664e-01
   2.63771332e-01  5.49626527e-01  1.07713508e-01 -3.99260390e-01
   1.66056546e-02 -3.30876118e-02  2.03029913e-01 -1.06621517e-01
   8.55669069e-02  2.40609291e-02]
 [ 4.19359352e-02  5.03621577e-01 -7.38640211e-02 -1.15399624e-01
  -1.38068605e-01 -1.31513077e-01  8.58226790e-02 -3.30394999e-01
  -5.55267166e-01  3.62547303e-01 -1.26596148e-01  1.21942784e-01
   1.27186667e-03 -2.96030848e-01 -5.79407509e-02 -3.08034829e-02
  -5.10909842e-02  8.79644677e-02]
 [-3.60832115e-02 -1.57663214e-02  5.59173987e-01  4.73703309e-01
  -5.66552244e-01 -3.19176094e-01  1.24532179e-01  1.14255395e-01
  -5.99039250e-02 -5.79891873e-02 -3.25785780e-02  2.88590518e-03
  -4.24341185e-04  4.01635562e-03 -8.22261600e-03  2.05544442e-02
  -4.39201991e-03 -3.76172016e-02]
 [-5.87204797e-02 -9.27462386e-02 -6.70680496e-01  4.28426032e-01
  -1.30869817e-01 -4.68404967e-01 -3.02517700e-01 -1.15403870e-01
   5.23845772e-02  1.28995278e-02 -3.62255133e-02 -1.62495314e-02
  -9.40554994e-03  8.00562035e-02  1.12172401e-02 -2.31296836e-03
   1.13702813e-02  4.44850199e-02]
 [-3.80131449e-02 -5.01621218e-01  6.22407145e-02 -2.74095968e-02
  -1.80519293e-01  2.80136438e-01 -2.58250261e-01 -9.46599623e-02
  -3.79168935e-01  1.87848521e-01 -1.38657118e-01  8.24506703e-02
   2.60800892e-02  2.45816461e-01 -7.88567114e-02 -2.81093089e-01
   3.19960307e-01  3.19055407e-01]
 [-8.47399995e-02 -5.07612106e-01  4.17053530e-02  9.60374943e-02
   1.10788067e-01  5.94444089e-02 -1.73269228e-01 -6.49718344e-03
  -2.80340510e-01  1.33402674e-01  8.39926899e-02 -1.29951586e-01
  -4.18109835e-03 -5.18420304e-01 -3.18514877e-02  2.41164948e-01
  -3.10989286e-01 -3.65128378e-01]]

 Eigen Values 
%s [9.74940269e+00 3.35071912e+00 1.19238155e+00 1.13381916e+00
 8.83997312e-01 6.66265745e-01 3.18150910e-01 2.28179142e-01
 1.31018595e-01 7.98619108e-02 7.33979478e-02 6.46162669e-02
 5.16287320e-03 4.01448646e-02 1.98136761e-02 2.27005257e-02
 3.22758478e-02 2.93936408e-02]
In [53]:
# Step 3 (continued): Sort eigenvalues in descending order

# Make a set of (eigenvalue, eigenvector) pairs
eig_pairs = [(eigenvalues[index], eigenvectors[:,index]) for index in range(len(eigenvalues))]

# Sort the (eigenvalue, eigenvector) pairs from highest to lowest with respect to eigenvalue
eig_pairs.sort()

eig_pairs.reverse()
print(eig_pairs)

# Extract the descending ordered eigenvalues and eigenvectors
eigvalues_sorted = [eig_pairs[index][0] for index in range(len(eigenvalues))]
eigvectors_sorted = [eig_pairs[index][1] for index in range(len(eigenvalues))]

# Let's confirm our sorting worked, print out eigenvalues
print('Eigenvalues in descending order: \n%s' %eigvalues_sorted)
[(9.749402689379597, array([-0.27250289, -0.28725469, -0.30242111, -0.26971354, -0.09786073,
       -0.19520014, -0.31052393,  0.3090069 , -0.307287  , -0.27815416,
       -0.29976509, -0.30553237, -0.26323762,  0.04193594, -0.03608321,
       -0.05872048, -0.03801314, -0.08474   ])), (3.350719119412978, array([-0.08704358,  0.13162176, -0.04614301, -0.19793126, -0.25783995,
       -0.10804563,  0.07528535, -0.01322994,  0.0875602 ,  0.12215424,
        0.07726575,  0.07150302,  0.21058205,  0.50362158, -0.01576632,
       -0.09274624, -0.50162122, -0.50761211])), (1.1923815452731639, array([ 0.03818521,  0.20114691, -0.06346211, -0.05628517,  0.06199275,
        0.14895782, -0.10904283,  0.09085269, -0.1060705 ,  0.21368469,
       -0.1445998 , -0.11034374,  0.20287019, -0.07386402,  0.55917399,
       -0.6706805 ,  0.06224071,  0.04170535])), (1.1338191632147838, array([ 0.13867501, -0.03805548,  0.10895429, -0.25435509, -0.61276572,
        0.27867816,  0.00539295,  0.06521486,  0.03089915,  0.04146747,
       -0.06400509, -0.00219687, -0.08553965, -0.11539962,  0.47370331,
        0.42842603, -0.0274096 ,  0.09603749])), (0.8839973120036095, array([-0.13710147,  0.13899555,  0.08001743, -0.13374437, -0.12360146,
        0.63489336, -0.08555745,  0.07907344, -0.08164638,  0.25111294,
       -0.14747123, -0.11010098,  0.00521211, -0.1380686 , -0.56655224,
       -0.13086982, -0.18051929,  0.11078807])), (0.6662657454310769, array([ 0.26361138, -0.07134742, -0.01690062, -0.13818365, -0.57782861,
       -0.289097  ,  0.09774711, -0.07572829,  0.10540323, -0.07819621,
        0.1329124 ,  0.11539822, -0.0670574 , -0.13151308, -0.31917609,
       -0.46840497,  0.28013644,  0.05944441])), (0.31815090958438486, array([ 0.20271711, -0.39227536,  0.16337128,  0.16191053,  0.09276331,
        0.39826629,  0.09235194, -0.1040706 ,  0.09313178, -0.35456434,
        0.06805461,  0.09011943, -0.45529272,  0.08582268,  0.12453218,
       -0.3025177 , -0.25825026, -0.17326923])), (0.2281791421155407, array([-0.75879641, -0.06760342,  0.27737195,  0.11054475, -0.18685876,
       -0.0462188 ,  0.06462042, -0.19234282,  0.01386846, -0.21516342,
        0.19567872,  0.03779482,  0.14675266, -0.330395  ,  0.1142554 ,
       -0.11540387, -0.09465996, -0.00649718])), (0.13101859512585473, array([ 0.36668592,  0.05532619,  0.07467849,  0.26666667, -0.03862966,
       -0.13716337, -0.13156766,  0.28963351, -0.0895291 , -0.15823198,
        0.04270347, -0.15107267,  0.26377133, -0.55526717, -0.05990393,
        0.05238458, -0.37916894, -0.28034051])), (0.07986191082036508, array([ 0.16004522, -0.18232396,  0.27303378, -0.05059872, -0.03430379,
        0.1779608 , -0.14313264, -0.07938311, -0.2398967 , -0.38273948,
        0.16609091, -0.28745769,  0.54962653,  0.3625473 , -0.05798919,
        0.01289953,  0.18784852,  0.13340267])), (0.07339794782509106, array([ 0.08402528, -0.03652299,  0.46850553, -0.54552603,  0.26502324,
       -0.19284602,  0.09671724, -0.02299264,  0.15935692, -0.14283702,
       -0.45966761,  0.20934562,  0.10771351, -0.12659615, -0.03257858,
       -0.03622551, -0.13865712,  0.08399269])), (0.06461626687535525, array([ 0.02146452,  0.14724751,  0.65273085,  0.07521887, -0.02402873,
       -0.22974149, -0.1531185 ,  0.0233454 , -0.21763624,  0.315261  ,
        0.11838316, -0.33134088, -0.39926039,  0.12194278,  0.00288591,
       -0.01624953,  0.08245067, -0.12995159])), (0.04014486457709953, array([ 0.06890823,  0.05905348, -0.16210815, -0.39328825,  0.16377115,
        0.1365761 ,  0.06489176, -0.49627326, -0.1136641 , -0.00866068,
        0.13598592, -0.24292244, -0.03308761, -0.29603085,  0.00401636,
        0.0800562 ,  0.24581646, -0.5184203 ])), (0.032275847766898305, array([-0.08225901, -0.25910077,  0.12062978, -0.18624157,  0.12463937,
        0.14080437, -0.14310972,  0.51196677, -0.06759167,  0.16092606,
        0.3241398 ,  0.46282787,  0.08556691, -0.05109098, -0.00439202,
        0.01137028,  0.31996031, -0.31098929])), (0.02939364075031221, array([-0.03303669,  0.24883201,  0.09805615,  0.36076515, -0.17764759,
        0.0999007 , -0.05284575, -0.09499061, -0.21672717, -0.20026207,
       -0.553139  ,  0.32249953,  0.02406093,  0.08796447, -0.0376172 ,
        0.04448502,  0.31905541, -0.36512838])), (0.022700525706219703, array([ 0.0997785 ,  0.16346695, -0.06365823, -0.13328441,  0.02146656,
       -0.06899343, -0.1565857 , -0.24403033, -0.67154739, -0.05006435,
        0.21741617,  0.44893662, -0.10662152, -0.03080348,  0.02055444,
       -0.00231297, -0.28109309,  0.24116495])), (0.019813676080863922, array([ 4.26105276e-02, -6.74107885e-01, -4.99754439e-04,  1.74861248e-01,
       -6.31976228e-02, -9.62482815e-02, -4.36596954e-02, -3.08568675e-01,
       -1.63739102e-01,  5.08763287e-01, -2.52182911e-01,  3.94502237e-02,
        2.03029913e-01, -5.79407509e-02, -8.22261600e-03,  1.12172401e-02,
       -7.88567114e-02, -3.18514877e-02])), (0.0051628732047457404, array([-1.87350749e-02, -4.89102355e-02,  4.74162132e-03,  3.70499547e-03,
        8.90928349e-03,  4.09727876e-03,  8.55513044e-01,  2.61858734e-01,
       -4.22479708e-01,  2.00493082e-02, -4.15194745e-02, -1.22365190e-01,
        1.66056546e-02,  1.27186667e-03, -4.24341185e-04, -9.40554994e-03,
        2.60800892e-02, -4.18109835e-03]))]
Eigenvalues in descending order: 
[9.749402689379597, 3.350719119412978, 1.1923815452731639, 1.1338191632147838, 0.8839973120036095, 0.6662657454310769, 0.31815090958438486, 0.2281791421155407, 0.13101859512585473, 0.07986191082036508, 0.07339794782509106, 0.06461626687535525, 0.04014486457709953, 0.032275847766898305, 0.02939364075031221, 0.022700525706219703, 0.019813676080863922, 0.0051628732047457404]
In [54]:
tot = sum(eigenvalues)
var_explained = [(i / tot) for i in sorted(eigenvalues, reverse=True)]  # an array of variance explained by each 
# eigen vector... there will be 8 entries as there are 8 eigen vectors)
cum_var_exp = np.cumsum(var_explained)  # an array of cumulative variance. There will be 8 entries with 8 th entry 
# cumulative reaching almost 100%
In [55]:
cum_var_exp.size
Out[55]:
18
In [56]:
plt.bar(range(1,19), var_explained, alpha=0.5, align='center', label='individual explained variance')
plt.step(range(1,19),cum_var_exp, where= 'mid', label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc = 'best')
plt.show()
In [57]:
# From graph we find that most of the information are contained in first 6 Principal Components
# P_reduce represents reduced mathematical space....

P_reduce = np.array(eigvectors_sorted[0:6])   # Reducing from 18 to 6 dimension space

X_std_6D = np.dot(X_std,P_reduce.T)   # projecting original data into principal component dimensions

Proj_data_df = pd.DataFrame(X_std_6D)  # converting array to dataframe for pairplot
In [58]:
#Let us check it visually


sns.pairplot(Proj_data_df, diag_kind='kde') 
Out[58]:
<seaborn.axisgrid.PairGrid at 0x1ed9b4be358>
In [59]:
# As expected now the spread of the off diagonal data points in the mathematical space is almost spherical
In [60]:
from sklearn import model_selection

test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
In [61]:
## We will use the Naive Bayes & Support Vector Classifiers
In [62]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
In [63]:
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV

model = SVC()

params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}

model1 = GridSearchCV(model, param_grid=params, verbose=5)

model1.fit(X_train, y_train)

print("Best Hyper Parameters:\n", model1.best_params_)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.7878787878787878, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.7575757575757576, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] ... C=0.01, kernel=linear, score=0.826530612244898, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ..... C=0.01, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.7929292929292929, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.7676767676767676, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.8214285714285714, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7070707070707071, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7222222222222222, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7295918367346939, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8131313131313131, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.7828282828282829, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8316326530612245, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.8484848484848485, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.8686868686868687, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.9183673469387755, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.8131313131313131, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.7777777777777778, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.8367346938775511, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.8636363636363636, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.8636363636363636, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9336734693877551, total=   0.0s
Best Hyper Parameters:
 {'C': 1, 'kernel': 'rbf'}
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [64]:
# Best Hyper Parameters:
# {'C': 1, 'kernel': 'rbf'}
In [65]:
# calculate accuracy measures and confusion matrix
from sklearn import metrics
In [66]:
#Build the model with the best hyper parameters
svc_model = SVC(C=1, kernel="rbf")

# Fitting the mode
svc_model.fit(X_train, y_train)

#Prediction on test set
prediction = svc_model.predict(X_test)

# Accuracy on test set
accuracy =  svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
Classification report
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        66
           1       0.96      0.92      0.94       127
           2       0.86      0.93      0.90        61

   micro avg       0.94      0.94      0.94       254
   macro avg       0.93      0.94      0.94       254
weighted avg       0.94      0.94      0.94       254

Confusion matrix
[[ 64   2   0]
 [  1 117   9]
 [  1   3  57]]
Overall score  0.937007874015748
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [67]:
# We get an overall score of 93.7% with a good recall value
In [68]:
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
    seed = i
    X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
    # Fitting the mode
    model.fit(X_train, y_train)
    #Prediction on test set
    prediction = model.predict(X_test)
    # Accuracy on test set
    accuracy =  model.score(X_test, y_test)
    expected=y_test
    print("Iteration ",itr)
    itr=itr+1
    print()
    print("data split random state ",seed)
    print("Classification report")
    print(metrics.classification_report(expected, prediction))
    print("Confusion matrix")
    print(metrics.confusion_matrix(expected, prediction))
    print("Overall score ",accuracy)
    print("----------------------------------------------------")
Iteration  1

data split random state  57
Classification report
              precision    recall  f1-score   support

           0       0.78      0.55      0.65        65
           1       0.84      0.84      0.84       138
           2       0.54      0.75      0.63        51

   micro avg       0.75      0.75      0.75       254
   macro avg       0.72      0.71      0.71       254
weighted avg       0.77      0.75      0.75       254

Confusion matrix
[[ 36  11  18]
 [  8 116  14]
 [  2  11  38]]
Overall score  0.7480314960629921
----------------------------------------------------
Iteration  2

data split random state  52
Classification report
              precision    recall  f1-score   support

           0       0.71      0.71      0.71        59
           1       0.83      0.80      0.82       135
           2       0.71      0.77      0.74        60

   micro avg       0.77      0.77      0.77       254
   macro avg       0.75      0.76      0.75       254
weighted avg       0.77      0.77      0.77       254

Confusion matrix
[[ 42   9   8]
 [ 16 108  11]
 [  1  13  46]]
Overall score  0.7716535433070866
----------------------------------------------------
Iteration  3

data split random state  73
Classification report
              precision    recall  f1-score   support

           0       0.77      0.75      0.76        67
           1       0.83      0.82      0.82       132
           2       0.66      0.71      0.68        55

   micro avg       0.78      0.78      0.78       254
   macro avg       0.75      0.76      0.76       254
weighted avg       0.78      0.78      0.78       254

Confusion matrix
[[ 50   9   8]
 [ 12 108  12]
 [  3  13  39]]
Overall score  0.7755905511811023
----------------------------------------------------
Iteration  4

data split random state  11
Classification report
              precision    recall  f1-score   support

           0       0.71      0.71      0.71        59
           1       0.84      0.83      0.83       134
           2       0.65      0.67      0.66        61

   micro avg       0.76      0.76      0.76       254
   macro avg       0.73      0.74      0.74       254
weighted avg       0.77      0.76      0.76       254

Confusion matrix
[[ 42   7  10]
 [ 11 111  12]
 [  6  14  41]]
Overall score  0.7637795275590551
----------------------------------------------------
Iteration  5

data split random state  55
Classification report
              precision    recall  f1-score   support

           0       0.83      0.72      0.77        72
           1       0.83      0.82      0.82       131
           2       0.63      0.76      0.69        51

   micro avg       0.78      0.78      0.78       254
   macro avg       0.76      0.77      0.76       254
weighted avg       0.79      0.78      0.78       254

Confusion matrix
[[ 52  10  10]
 [ 11 107  13]
 [  0  12  39]]
Overall score  0.7795275590551181
----------------------------------------------------
Iteration  6

data split random state  87
Classification report
              precision    recall  f1-score   support

           0       0.85      0.66      0.74        67
           1       0.82      0.90      0.86       120
           2       0.70      0.73      0.72        67

   micro avg       0.79      0.79      0.79       254
   macro avg       0.79      0.76      0.77       254
weighted avg       0.79      0.79      0.79       254

Confusion matrix
[[ 44   8  15]
 [  6 108   6]
 [  2  16  49]]
Overall score  0.7913385826771654
----------------------------------------------------
Iteration  7

data split random state  36
Classification report
              precision    recall  f1-score   support

           0       0.79      0.83      0.81        53
           1       0.82      0.86      0.84       132
           2       0.72      0.64      0.68        69

   micro avg       0.79      0.79      0.79       254
   macro avg       0.78      0.77      0.77       254
weighted avg       0.79      0.79      0.79       254

Confusion matrix
[[ 44   4   5]
 [  7 113  12]
 [  5  20  44]]
Overall score  0.7913385826771654
----------------------------------------------------
Iteration  8

data split random state  74
Classification report
              precision    recall  f1-score   support

           0       0.73      0.61      0.67        67
           1       0.81      0.78      0.80       129
           2       0.65      0.83      0.73        58

   micro avg       0.75      0.75      0.75       254
   macro avg       0.73      0.74      0.73       254
weighted avg       0.75      0.75      0.75       254

Confusion matrix
[[ 41  14  12]
 [ 14 101  14]
 [  1   9  48]]
Overall score  0.7480314960629921
----------------------------------------------------
Iteration  9

data split random state  37
Classification report
              precision    recall  f1-score   support

           0       0.73      0.74      0.74        62
           1       0.86      0.78      0.81       130
           2       0.67      0.79      0.73        62

   micro avg       0.77      0.77      0.77       254
   macro avg       0.75      0.77      0.76       254
weighted avg       0.78      0.77      0.77       254

Confusion matrix
[[ 46   6  10]
 [ 15 101  14]
 [  2  11  49]]
Overall score  0.7716535433070866
----------------------------------------------------
Iteration  10

data split random state  8
Classification report
              precision    recall  f1-score   support

           0       0.86      0.69      0.77        74
           1       0.79      0.86      0.82       115
           2       0.64      0.68      0.66        65

   micro avg       0.76      0.76      0.76       254
   macro avg       0.76      0.74      0.75       254
weighted avg       0.77      0.76      0.76       254

Confusion matrix
[[51  8 15]
 [ 6 99 10]
 [ 2 19 44]]
Overall score  0.7637795275590551
----------------------------------------------------
In [69]:
# For Naive Bayes we get a highest overall score of 79% for data split random state = 36

Now increasing principal components form 6 to 9

In [70]:
P_reduce = np.array(eigvectors_sorted[0:9])   # increasing from 6 to 9 dimension space

X_std_9D = np.dot(X_std,P_reduce.T)   # projecting original data into principal component dimensions

Proj_data_df = pd.DataFrame(X_std_9D)  # converting array to dataframe for pairplot
In [71]:
sns.pairplot(Proj_data_df, diag_kind='kde') 
Out[71]:
<seaborn.axisgrid.PairGrid at 0x1ed98d976a0>
In [72]:
from sklearn import model_selection

test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
In [73]:
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV

model = SVC()

params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}

model1 = GridSearchCV(model, param_grid=params, verbose=5)

model1.fit(X_train, y_train)

print("Best Hyper Parameters:\n", model1.best_params_)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8333333333333334, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8232323232323232, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8622448979591837, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ..... C=0.01, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.9141414141414141, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.8535353535353535, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.8775510204081632, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7626262626262627, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.8181818181818182, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7806122448979592, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.9191919191919192, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8585858585858586, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8979591836734694, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.9191919191919192, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.9141414141414141, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ....... C=0.5, kernel=rbf, score=0.923469387755102, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.9040404040404041, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.8686868686868687, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.8826530612244898, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9191919191919192, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9191919191919192, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9183673469387755, total=   0.0s
Best Hyper Parameters:
 {'C': 0.5, 'kernel': 'rbf'}
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.1s finished
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [74]:
# Best Hyper Parameters:
 # {'C': 0.5, 'kernel': 'rbf'}
In [75]:
#Build the model with the best hyper parameters
svc_model = SVC(C=0.5, kernel="rbf")

# Fitting the mode
svc_model.fit(X_train, y_train)

#Prediction on test set
prediction = svc_model.predict(X_test)

# Accuracy on test set
accuracy =  svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
Classification report
              precision    recall  f1-score   support

           0       0.97      0.98      0.98        66
           1       0.98      0.98      0.98       127
           2       0.97      0.93      0.95        61

   micro avg       0.97      0.97      0.97       254
   macro avg       0.97      0.97      0.97       254
weighted avg       0.97      0.97      0.97       254

Confusion matrix
[[ 65   1   0]
 [  0 125   2]
 [  2   2  57]]
Overall score  0.9724409448818898
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [76]:
# The overall score increased to 97.2% when dimensions are increased from 6 to 9
In [77]:
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
    seed = i
    X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
    # Fitting the mode
    model.fit(X_train, y_train)
    #Prediction on test set
    prediction = model.predict(X_test)
    # Accuracy on test set
    accuracy =  model.score(X_test, y_test)
    expected=y_test
    print("Iteration ",itr)
    itr=itr+1
    print()
    print("data split random state ",seed)
    print("Classification report")
    print(metrics.classification_report(expected, prediction))
    print("Confusion matrix")
    print(metrics.confusion_matrix(expected, prediction))
    print("Overall score ",accuracy)
    print("----------------------------------------------------")
Iteration  1

data split random state  83
Classification report
              precision    recall  f1-score   support

           0       0.90      0.72      0.80        75
           1       0.87      0.94      0.91       123
           2       0.75      0.82      0.79        56

   micro avg       0.85      0.85      0.85       254
   macro avg       0.84      0.83      0.83       254
weighted avg       0.85      0.85      0.85       254

Confusion matrix
[[ 54   8  13]
 [  5 116   2]
 [  1   9  46]]
Overall score  0.8503937007874016
----------------------------------------------------
Iteration  2

data split random state  3
Classification report
              precision    recall  f1-score   support

           0       0.82      0.76      0.79        71
           1       0.88      0.87      0.88       123
           2       0.73      0.82      0.77        60

   micro avg       0.83      0.83      0.83       254
   macro avg       0.81      0.82      0.81       254
weighted avg       0.83      0.83      0.83       254

Confusion matrix
[[ 54   5  12]
 [ 10 107   6]
 [  2   9  49]]
Overall score  0.8267716535433071
----------------------------------------------------
Iteration  3

data split random state  98
Classification report
              precision    recall  f1-score   support

           0       0.82      0.83      0.82        64
           1       0.92      0.88      0.90       138
           2       0.74      0.81      0.77        52

   micro avg       0.85      0.85      0.85       254
   macro avg       0.83      0.84      0.83       254
weighted avg       0.86      0.85      0.86       254

Confusion matrix
[[ 53   1  10]
 [ 11 122   5]
 [  1   9  42]]
Overall score  0.8543307086614174
----------------------------------------------------
Iteration  4

data split random state  11
Classification report
              precision    recall  f1-score   support

           0       0.81      0.86      0.84        59
           1       0.88      0.90      0.89       134
           2       0.85      0.77      0.81        61

   micro avg       0.86      0.86      0.86       254
   macro avg       0.85      0.84      0.85       254
weighted avg       0.86      0.86      0.86       254

Confusion matrix
[[ 51   5   3]
 [  9 120   5]
 [  3  11  47]]
Overall score  0.8582677165354331
----------------------------------------------------
Iteration  5

data split random state  66
Classification report
              precision    recall  f1-score   support

           0       0.86      0.71      0.77        68
           1       0.90      0.92      0.91       125
           2       0.71      0.82      0.76        61

   micro avg       0.84      0.84      0.84       254
   macro avg       0.82      0.82      0.82       254
weighted avg       0.84      0.84      0.84       254

Confusion matrix
[[ 48   4  16]
 [  6 115   4]
 [  2   9  50]]
Overall score  0.8385826771653543
----------------------------------------------------
Iteration  6

data split random state  14
Classification report
              precision    recall  f1-score   support

           0       0.83      0.81      0.82        62
           1       0.87      0.90      0.88       135
           2       0.82      0.79      0.80        57

   micro avg       0.85      0.85      0.85       254
   macro avg       0.84      0.83      0.84       254
weighted avg       0.85      0.85      0.85       254

Confusion matrix
[[ 50   7   5]
 [  9 121   5]
 [  1  11  45]]
Overall score  0.8503937007874016
----------------------------------------------------
Iteration  7

data split random state  71
Classification report
              precision    recall  f1-score   support

           0       0.87      0.78      0.82        68
           1       0.86      0.91      0.88       124
           2       0.85      0.84      0.85        62

   micro avg       0.86      0.86      0.86       254
   macro avg       0.86      0.84      0.85       254
weighted avg       0.86      0.86      0.86       254

Confusion matrix
[[ 53  11   4]
 [  6 113   5]
 [  2   8  52]]
Overall score  0.8582677165354331
----------------------------------------------------
Iteration  8

data split random state  25
Classification report
              precision    recall  f1-score   support

           0       0.87      0.78      0.82        59
           1       0.88      0.88      0.88       130
           2       0.79      0.85      0.81        65

   micro avg       0.85      0.85      0.85       254
   macro avg       0.84      0.84      0.84       254
weighted avg       0.85      0.85      0.85       254

Confusion matrix
[[ 46   7   6]
 [  6 115   9]
 [  1   9  55]]
Overall score  0.8503937007874016
----------------------------------------------------
Iteration  9

data split random state  53
Classification report
              precision    recall  f1-score   support

           0       0.81      0.81      0.81        64
           1       0.89      0.87      0.88       135
           2       0.78      0.84      0.81        55

   micro avg       0.85      0.85      0.85       254
   macro avg       0.83      0.84      0.83       254
weighted avg       0.85      0.85      0.85       254

Confusion matrix
[[ 52   5   7]
 [ 12 117   6]
 [  0   9  46]]
Overall score  0.8464566929133859
----------------------------------------------------
Iteration  10

data split random state  93
Classification report
              precision    recall  f1-score   support

           0       0.87      0.88      0.87        67
           1       0.91      0.86      0.89       136
           2       0.72      0.82      0.77        51

   micro avg       0.86      0.86      0.86       254
   macro avg       0.84      0.85      0.84       254
weighted avg       0.86      0.86      0.86       254

Confusion matrix
[[ 59   2   6]
 [  9 117  10]
 [  0   9  42]]
Overall score  0.8582677165354331
----------------------------------------------------
In [78]:
# Now we get highest score Naive Bayes for random state 11 with overall score of 85%

Using all the principal components

In [79]:
P_reduce = np.array(eigvectors_sorted)   # taking all the dimensions

X_std_all = np.dot(X_std,P_reduce.T)   # projecting original data into principal component dimensions

Proj_data_df = pd.DataFrame(X_std_all)  # converting array to dataframe for pairplot
In [80]:
sns.pairplot(Proj_data_df, diag_kind='kde') 
Out[80]:
<seaborn.axisgrid.PairGrid at 0x1eda550d7b8>
In [81]:
from sklearn import model_selection

test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
In [82]:
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV

model = SVC()

params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}

model1 = GridSearchCV(model, param_grid=params, verbose=5)

model1.fit(X_train, y_train)

print("Best Hyper Parameters:\n", model1.best_params_)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8383838383838383, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8232323232323232, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.8826530612244898, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ..... C=0.01, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.9292929292929293, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.8585858585858586, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.8877551020408163, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.8333333333333334, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.8282828282828283, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ...... C=0.1, kernel=rbf, score=0.7959183673469388, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.9343434343434344, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8686868686868687, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.9387755102040817, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.9141414141414141, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.9242424242424242, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ....... C=0.5, kernel=rbf, score=0.923469387755102, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.9494949494949495, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ...... C=1, kernel=linear, score=0.898989898989899, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.9540816326530612, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9393939393939394, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9292929292929293, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.9387755102040817, total=   0.0s
Best Hyper Parameters:
 {'C': 1, 'kernel': 'rbf'}
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.2s finished
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [83]:
#Build the model with the best hyper parameters
svc_model = SVC(C=1, kernel="rbf")

# Fitting the mode
svc_model.fit(X_train, y_train)

#Prediction on test set
prediction = svc_model.predict(X_test)

# Accuracy on test set
accuracy =  svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
Classification report
              precision    recall  f1-score   support

           0       0.97      0.98      0.98        66
           1       0.98      0.98      0.98       127
           2       0.97      0.97      0.97        61

   micro avg       0.98      0.98      0.98       254
   macro avg       0.97      0.98      0.97       254
weighted avg       0.98      0.98      0.98       254

Confusion matrix
[[ 65   1   0]
 [  1 124   2]
 [  1   1  59]]
Overall score  0.9763779527559056
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [85]:
# The overall score is 97.6 % . This is almost near to the score that we get while choosing only 9 dimensions, which shows that most of the information are stored in the first 9 principal components chosen earlier
In [86]:
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
    seed = i
    X_train, X_test, y_train, y_test = model_selection.train_test_split(Proj_data_df, y, test_size=test_size, random_state=seed)
    # Fitting the mode
    model.fit(X_train, y_train)
    #Prediction on test set
    prediction = model.predict(X_test)
    # Accuracy on test set
    accuracy =  model.score(X_test, y_test)
    expected=y_test
    print("Iteration ",itr)
    itr=itr+1
    print()
    print("data split random state ",seed)
    print("Classification report")
    print(metrics.classification_report(expected, prediction))
    print("Confusion matrix")
    print(metrics.confusion_matrix(expected, prediction))
    print("Overall score ",accuracy)
    print("----------------------------------------------------")
Iteration  1

data split random state  41
Classification report
              precision    recall  f1-score   support

           0       0.88      0.73      0.80        70
           1       0.84      0.95      0.89       131
           2       0.86      0.79      0.82        53

   micro avg       0.85      0.85      0.85       254
   macro avg       0.86      0.82      0.84       254
weighted avg       0.86      0.85      0.85       254

Confusion matrix
[[ 51  14   5]
 [  5 124   2]
 [  2   9  42]]
Overall score  0.8543307086614174
----------------------------------------------------
Iteration  2

data split random state  65
Classification report
              precision    recall  f1-score   support

           0       0.90      0.68      0.77        68
           1       0.82      0.93      0.88       121
           2       0.80      0.82      0.81        65

   micro avg       0.83      0.83      0.83       254
   macro avg       0.84      0.81      0.82       254
weighted avg       0.84      0.83      0.83       254

Confusion matrix
[[ 46  14   8]
 [  3 113   5]
 [  2  10  53]]
Overall score  0.8346456692913385
----------------------------------------------------
Iteration  3

data split random state  59
Classification report
              precision    recall  f1-score   support

           0       0.85      0.79      0.82        73
           1       0.86      0.94      0.90       126
           2       0.90      0.78      0.83        55

   micro avg       0.87      0.87      0.87       254
   macro avg       0.87      0.84      0.85       254
weighted avg       0.87      0.87      0.86       254

Confusion matrix
[[ 58  12   3]
 [  5 119   2]
 [  5   7  43]]
Overall score  0.8661417322834646
----------------------------------------------------
Iteration  4

data split random state  29
Classification report
              precision    recall  f1-score   support

           0       0.85      0.73      0.79        64
           1       0.88      0.95      0.91       133
           2       0.88      0.86      0.87        57

   micro avg       0.87      0.87      0.87       254
   macro avg       0.87      0.85      0.86       254
weighted avg       0.87      0.87      0.87       254

Confusion matrix
[[ 47  12   5]
 [  5 126   2]
 [  3   5  49]]
Overall score  0.8740157480314961
----------------------------------------------------
Iteration  5

data split random state  45
Classification report
              precision    recall  f1-score   support

           0       0.94      0.66      0.78        77
           1       0.80      0.91      0.85       122
           2       0.76      0.85      0.80        55

   micro avg       0.82      0.82      0.82       254
   macro avg       0.84      0.81      0.81       254
weighted avg       0.84      0.82      0.82       254

Confusion matrix
[[ 51  19   7]
 [  3 111   8]
 [  0   8  47]]
Overall score  0.8228346456692913
----------------------------------------------------
Iteration  6

data split random state  22
Classification report
              precision    recall  f1-score   support

           0       0.80      0.63      0.71        65
           1       0.77      0.89      0.82       123
           2       0.80      0.74      0.77        66

   micro avg       0.78      0.78      0.78       254
   macro avg       0.79      0.75      0.77       254
weighted avg       0.79      0.78      0.78       254

Confusion matrix
[[ 41  18   6]
 [  8 109   6]
 [  2  15  49]]
Overall score  0.7834645669291339
----------------------------------------------------
Iteration  7

data split random state  1
Classification report
              precision    recall  f1-score   support

           0       0.88      0.73      0.80        59
           1       0.85      0.96      0.90       133
           2       0.89      0.79      0.84        62

   micro avg       0.87      0.87      0.87       254
   macro avg       0.87      0.83      0.85       254
weighted avg       0.87      0.87      0.86       254

Confusion matrix
[[ 43  13   3]
 [  2 128   3]
 [  4   9  49]]
Overall score  0.8661417322834646
----------------------------------------------------
Iteration  8

data split random state  27
Classification report
              precision    recall  f1-score   support

           0       0.80      0.73      0.77        56
           1       0.85      0.90      0.87       145
           2       0.78      0.74      0.76        53

   micro avg       0.83      0.83      0.83       254
   macro avg       0.81      0.79      0.80       254
weighted avg       0.83      0.83      0.83       254

Confusion matrix
[[ 41  10   5]
 [  9 130   6]
 [  1  13  39]]
Overall score  0.8267716535433071
----------------------------------------------------
Iteration  9

data split random state  17
Classification report
              precision    recall  f1-score   support

           0       0.87      0.76      0.81        63
           1       0.87      0.93      0.90       135
           2       0.80      0.80      0.80        56

   micro avg       0.86      0.86      0.86       254
   macro avg       0.85      0.83      0.84       254
weighted avg       0.86      0.86      0.86       254

Confusion matrix
[[ 48   9   6]
 [  5 125   5]
 [  2   9  45]]
Overall score  0.8582677165354331
----------------------------------------------------
Iteration  10

data split random state  42
Classification report
              precision    recall  f1-score   support

           0       0.92      0.69      0.79        78
           1       0.86      0.95      0.90       118
           2       0.80      0.90      0.85        58

   micro avg       0.86      0.86      0.86       254
   macro avg       0.86      0.85      0.85       254
weighted avg       0.86      0.86      0.85       254

Confusion matrix
[[ 54  14  10]
 [  3 112   3]
 [  2   4  52]]
Overall score  0.8582677165354331
----------------------------------------------------
In [87]:
# The highest score obtained from Naive Bayes this time is 87% for random state = 29

If we dont use PCA then what we get :-

In [88]:
# SVC
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed)
#Grid search to tune model parameters for SVC
from sklearn.model_selection import GridSearchCV

model = SVC()

params = {'C': [0.01, 0.1, 0.5, 1], 'kernel': ['linear', 'rbf']}

model1 = GridSearchCV(model, param_grid=params, verbose=5)

model1.fit(X_train, y_train)

print("Best Hyper Parameters:\n", model1.best_params_)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.0s remaining:    0.0s
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.9191919191919192, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.9040404040404041, total=   0.0s
[CV] C=0.01, kernel=linear ...........................................
[CV] .. C=0.01, kernel=linear, score=0.9489795918367347, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ....... C=0.01, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.01, kernel=rbf ..............................................
[CV] ..... C=0.01, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.9393939393939394, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] .... C=0.1, kernel=linear, score=0.898989898989899, total=   0.0s
[CV] C=0.1, kernel=linear ............................................
[CV] ... C=0.1, kernel=linear, score=0.9183673469387755, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ........ C=0.1, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
[CV] ........ C=0.1, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.1, kernel=rbf ...............................................
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[CV] ...... C=0.1, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.9393939393939394, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.8939393939393939, total=   0.0s
[CV] C=0.5, kernel=linear ............................................
[CV] ... C=0.5, kernel=linear, score=0.9183673469387755, total=   0.1s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ........ C=0.5, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ........ C=0.5, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=0.5, kernel=rbf ...............................................
[CV] ...... C=0.5, kernel=rbf, score=0.5102040816326531, total=   0.0s
[CV] C=1, kernel=linear ..............................................
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[CV] ..... C=1, kernel=linear, score=0.9444444444444444, total=   0.3s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.8838383838383839, total=   0.0s
[CV] C=1, kernel=linear ..............................................
[CV] ..... C=1, kernel=linear, score=0.9081632653061225, total=   0.1s
[CV] C=1, kernel=rbf .................................................
[CV] .......... C=1, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] .......... C=1, kernel=rbf, score=0.51010101010101, total=   0.0s
[CV] C=1, kernel=rbf .................................................
[CV] ........ C=1, kernel=rbf, score=0.5102040816326531, total=   0.0s
Best Hyper Parameters:
 {'C': 0.01, 'kernel': 'linear'}
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\Asus\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    1.5s finished
In [89]:
#Build the model with the best hyper parameters
svc_model = SVC(C=0.01, kernel="linear")

# Fitting the mode
svc_model.fit(X_train, y_train)

#Prediction on test set
prediction = svc_model.predict(X_test)

# Accuracy on test set
accuracy =  svc_model.score(X_test, y_test)
expected=y_test
print("Classification report")
print(metrics.classification_report(expected, prediction))
print("Confusion matrix")
print(metrics.confusion_matrix(expected, prediction))
print("Overall score ",accuracy)
Classification report
              precision    recall  f1-score   support

           0       0.90      0.91      0.90        66
           1       0.93      0.94      0.93       127
           2       0.98      0.95      0.97        61

   micro avg       0.93      0.93      0.93       254
   macro avg       0.94      0.93      0.93       254
weighted avg       0.93      0.93      0.93       254

Confusion matrix
[[ 60   6   0]
 [  7 119   1]
 [  0   3  58]]
Overall score  0.9330708661417323
In [90]:
# Taking all the independent variables we do SVC and find the overall score as 93% which is less than the score that we get when we use PCA.
In [91]:
model = GaussianNB()
iterationList=np.random.randint(1,100,10)
itr = 1
for i in iterationList:
    seed = i
    X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=test_size, random_state=seed)
    # Fitting the mode
    model.fit(X_train, y_train)
    #Prediction on test set
    prediction = model.predict(X_test)
    # Accuracy on test set
    accuracy =  model.score(X_test, y_test)
    expected=y_test
    print("Iteration ",itr)
    itr=itr+1
    print()
    print("data split random state ",seed)
    print("Classification report")
    print(metrics.classification_report(expected, prediction))
    print("Confusion matrix")
    print(metrics.confusion_matrix(expected, prediction))
    print("Overall score ",accuracy)
    print("----------------------------------------------------")
Iteration  1

data split random state  76
Classification report
              precision    recall  f1-score   support

           0       0.94      0.24      0.38        71
           1       0.84      0.65      0.73       125
           2       0.38      0.91      0.54        58

   micro avg       0.59      0.59      0.59       254
   macro avg       0.72      0.60      0.55       254
weighted avg       0.76      0.59      0.59       254

Confusion matrix
[[17 12 42]
 [ 0 81 44]
 [ 1  4 53]]
Overall score  0.594488188976378
----------------------------------------------------
Iteration  2

data split random state  58
Classification report
              precision    recall  f1-score   support

           0       0.95      0.25      0.40        80
           1       0.81      0.66      0.72       122
           2       0.37      0.96      0.54        52

   micro avg       0.59      0.59      0.59       254
   macro avg       0.71      0.62      0.55       254
weighted avg       0.76      0.59      0.58       254

Confusion matrix
[[20 17 43]
 [ 1 80 41]
 [ 0  2 50]]
Overall score  0.5905511811023622
----------------------------------------------------
Iteration  3

data split random state  18
Classification report
              precision    recall  f1-score   support

           0       0.90      0.24      0.38        78
           1       0.76      0.66      0.71       115
           2       0.41      0.90      0.57        61

   micro avg       0.59      0.59      0.59       254
   macro avg       0.69      0.60      0.55       254
weighted avg       0.72      0.59      0.57       254

Confusion matrix
[[19 18 41]
 [ 2 76 37]
 [ 0  6 55]]
Overall score  0.5905511811023622
----------------------------------------------------
Iteration  4

data split random state  91
Classification report
              precision    recall  f1-score   support

           0       0.87      0.21      0.34        61
           1       0.82      0.61      0.70       134
           2       0.40      0.93      0.56        59

   micro avg       0.59      0.59      0.59       254
   macro avg       0.69      0.59      0.53       254
weighted avg       0.73      0.59      0.58       254

Confusion matrix
[[13 14 34]
 [ 2 82 50]
 [ 0  4 55]]
Overall score  0.5905511811023622
----------------------------------------------------
Iteration  5

data split random state  64
Classification report
              precision    recall  f1-score   support

           0       0.69      0.22      0.33        51
           1       0.86      0.62      0.72       128
           2       0.49      0.95      0.65        75

   micro avg       0.64      0.64      0.64       254
   macro avg       0.68      0.60      0.57       254
weighted avg       0.72      0.64      0.62       254

Confusion matrix
[[11 11 29]
 [ 3 80 45]
 [ 2  2 71]]
Overall score  0.6377952755905512
----------------------------------------------------
Iteration  6

data split random state  75
Classification report
              precision    recall  f1-score   support

           0       0.90      0.27      0.41        71
           1       0.85      0.66      0.74       120
           2       0.43      0.95      0.59        63

   micro avg       0.62      0.62      0.62       254
   macro avg       0.73      0.63      0.58       254
weighted avg       0.76      0.62      0.61       254

Confusion matrix
[[19 12 40]
 [ 1 79 40]
 [ 1  2 60]]
Overall score  0.6220472440944882
----------------------------------------------------
Iteration  7

data split random state  1
Classification report
              precision    recall  f1-score   support

           0       0.82      0.39      0.53        59
           1       0.87      0.63      0.73       133
           2       0.43      0.89      0.58        62

   micro avg       0.64      0.64      0.64       254
   macro avg       0.70      0.64      0.61       254
weighted avg       0.75      0.64      0.65       254

Confusion matrix
[[23  7 29]
 [ 4 84 45]
 [ 1  6 55]]
Overall score  0.6377952755905512
----------------------------------------------------
Iteration  8

data split random state  98
Classification report
              precision    recall  f1-score   support

           0       0.81      0.33      0.47        64
           1       0.88      0.68      0.77       138
           2       0.41      0.96      0.58        52

   micro avg       0.65      0.65      0.65       254
   macro avg       0.70      0.66      0.60       254
weighted avg       0.77      0.65      0.65       254

Confusion matrix
[[21 12 31]
 [ 4 94 40]
 [ 1  1 50]]
Overall score  0.6496062992125984
----------------------------------------------------
Iteration  9

data split random state  69
Classification report
              precision    recall  f1-score   support

           0       0.82      0.33      0.47        55
           1       0.85      0.66      0.74       132
           2       0.49      0.96      0.65        67

   micro avg       0.67      0.67      0.67       254
   macro avg       0.72      0.65      0.62       254
weighted avg       0.75      0.67      0.66       254

Confusion matrix
[[18 12 25]
 [ 4 87 41]
 [ 0  3 64]]
Overall score  0.6653543307086615
----------------------------------------------------
Iteration  10

data split random state  78
Classification report
              precision    recall  f1-score   support

           0       1.00      0.35      0.52        68
           1       0.92      0.68      0.78       120
           2       0.46      1.00      0.63        66

   micro avg       0.67      0.67      0.67       254
   macro avg       0.80      0.68      0.65       254
weighted avg       0.82      0.67      0.67       254

Confusion matrix
[[24  7 37]
 [ 0 81 39]
 [ 0  0 66]]
Overall score  0.6732283464566929
----------------------------------------------------
In [92]:
# The highest score for this Naive Bayes without PCA is 67% which is very less compared to the results that we get when we use PCA.

Conclusions:-

From the Principal Component Analysis we come to know that optimum number of dimensions that can be used for building a model is 9 , the same we get from the summary plot of eigen values sorted in descending order. Again from SVC and Naive Bayes we find that there is some considerable increase in the model performance/score when the dimensions are increased from 6 (93%) to 9 (97%). However the increase in performance is very low or almost negligible when all the 18 dimensions are taken into consideration. Also we noticed that the performance of model drops significantly when we dont use PCA. Hence from this project we infer that when there is a strong correlation among the independent variables/dimensions then PCA is the best choice as it helps to captures the covariance information and helps us to choose optimum number of dimensions resulting in increase in model performance.

In [ ]: